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PREDICTION OF PARTICLE TYPE GIVEN 
MEASUREMENTS OF PARTICLE LOCATION 

By Robert W. Johnson 
Alphawave Research 

The MaxEnt approach to the prediction of particle type given 
measurements of particle location is explored. Two types of parti- 
cle are considered, and locations are expressed in terms of a single 
spatial coordinate. Several cases corresponding to different states of 
prior knowledge are evaluated. Predictions are calculated using the 
expectation value, which is the average of the observable weighted by 
the evidence measure over the parameter manifold. How the method- 
ology extends to more general situations is discussed. 

1. Introduction. A common problem that appears in many guises is 
how to predict the type of some unidentified particle knowing only its loca- 
tion, given a list of locations for where such particles of identified type have 
been found. One example of such a problem is the prediction of whether a 
passerby crossing some particular line in space is male or female based upon 
knowledge of where on the line previous passersby of known gender have 
crossed. The problem is predicated on the assumption that measurements 
of location are inexpensive compared to measurements of classification, so 
that after an initial training set of data has been evaluated some algorithm 
can then be used to predict the classification based only upon a known 
location with some degree of certainty that can be determined. 

A specific case of this problem has recently been addressed by Hall, Park 
and Samworth (2008) using the method of kth nearest neighbor classifica- 
tion. In that paper, an optimal choice of k is evaluated empirically from the 
data, allowing for a nonparametric estimate of the desired probability. In 
this paper, we will examine how a parametric model is integrated over the 
parameter manifold with respect to the evidence measure to yield a pre- 
diction which depends only upon the given location. The MaxEnt aspect 
of the algorithm refers to the use of uninformative priors for the param- 
eters, which nonetheless may be nonuniform for some particular choice of 
coordinate mapping of the parameter manifold. 

The process of inductive reasoning is best described using the language 
of conditional probability theory, as demonstrated by Bretthorst (1988), 
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Durrett (1994), and Sivia (1996). Let us quickly review the notation and 
nomenclature that will be used in this paper. The formal statement of the 
expression for the probability of A given conditions B can be written as 



where A and B can have arbitrary dimensionality; for example, A could be a 
vector of measurements, and B could include both the vector of parameters 
associated with some model as well as any other conditioning statements 
such as the model index. The sum and product rules of probability theory 
yield the expressions for marginalization and Bayes' theorem, 



where marginalization follows from the requirement of unit normalization, 
and Bayes' theorem follows from requiring logical consistency of the joint 
density p A > B = p B < A . Certain names have come to be associated with the 
various factors above, but as Sivia (1996) points out, what one calls a proba- 
bility is irrelevant, as the distinction between p^ and p^ is carried explicitly 
by the differences in the conditioning statements. Nonetheless, the terms 
"likelihood" and "prior" are useful for describing how new data updates 
one's state of knowledge. Instead of the term "posterior" for the estimate of 
the parameter probability we will use "evidence", and the chance of mea- 
suring the data based on no other knowledge will not be named as it is not 
necessary for the normalization of the evidence measure nor for the evalua- 
tion of the relative evidence for competing models. 

2. Definition of the model. For consistency of comparison, we will 
follow as closely as possible the notation used by Hall, Park and Samworth 
(2008). The population of particles decomposes into two classifications, de- 
noted type X and type Y, and the location for each particle type is assumed 
to follow an independent normal distribution in the spatial dimension z. The 
measured locations of the particles identified as type X can be expressed as 
the vector X = Xj for integer j £ [1, J], and similarly Y = for k £ [1, if], 
such that N = J + K gives the total number of classified particles. The lo- 
cation measurements for type X are assumed to be drawn from the normal 
distribution p(Xj | /, /) = f(Xj), where 



(1) 



p(A \B)=p£, 



(3) 



(2) 




(4) 



/(z) = (2vr/ 2 )- 1 /2 exp -i/ 2[ ( / _ z)/ / ] 2 
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and similarly for p(Yk\g,g) = g(Yk), using notation exp a (/3) = e a/3 with 
an economy of brackets when possible. One may alternately interpret that 
equation to mean that the standard deviation of the location measurement 
process is equal to / (or g). The relative likelihood of a particle being a Y 
rather than an X is denoted by the parameter m, such that the absolute 
likelihood of being an X is p^ t = (1 + m) -1 . The parametric model, then, 
for the probability that some new, unclassified datum is of type X knowing 
only its location z is 

(5) p(X &tz\z, m, f, g, f,g) = [l + m ({z)}~ L , 

where C(z) = q(z)/ f(z), and by normalization p Yatz = 1 — p Xat - z ~_. The 

M ' yV " JK ' ^zmfgfg ^zmfgfg 

desired quantity, however, is the estimate of that likelihood given the data 
Pz^njkxy- Inductive reasoning is used to relate these quantities of interest. 

For brevity of notation, any knowledge that can be derived from the con- 
ditioning statements explicitly present will be suppressed, e.g. P^njkx.y = 
PzJLY*- Let us also collect the coordinates of the parameter manifold into 
the contravariant position vector r = [m, /, g, /, g], such that V = d/dr is 
a covariant vector. By marginalization, the desired quantity can be written 
as the expectation value of the observable as a function of the parameters 
weighted by the evidence measure integrated over the entire parameter man- 
ifold, not simply taken from its mode (most likely value), formally expressed 



(6) 



J{r} J{r} 



XY i 



where the conditioning on iV is implicit. 

The evidence measure decomposes into a product of factors according to 

what knowledge is required for their determination Pxy °^ P^jkPxPy > w hich 
themselves factor into products of likelihoods and priors. For example, the 
evidence density for m is py K oc p^p m , and similarly for the remaining 
factors. Addressing first the likelihood factors, the chance of observing K 
out of N particles of type Y given m is 

(7) pi K = m K /(l + m) J+K = m K (l + m)- N , 

whereas the chance of observing locations X given values for / and / is 

(8) Pfj = II /(*) = (2vr/ 2 )-^exp- 1 /^ [(/ _ z) Jff , 

zex zex 

and similarly pj~ = flz&Y d( z )- The likelihood of the data at manifold posi- 
tion r is then given by their product p^ Y = p^pffpY 5 - 
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Next let us look at the prior factors p r = p m pf f p99 . The MaxEnt approach 
to selecting an uninformative prior is based on the principle of indifference 
as represented by the requirement of consistency under various transforma- 
tions of the parameter or data coordinate mappings. Dose (2003) gives an 
excellent description of the process. According to Jaynes (1968), the prior 
suggested by Jeffreys for the parameters of a Gaussian distribution results 
from satisfaction of the functional equation for transformations in location 
and scale, thus pff99 oc (fg)^ 1 - In that paper, he argues that the same func- 
tional form is appropriate for the rate parameter of a Poisson process, which 
will be called n. In the presentation by Hall, Park and Samworth (2008), the 
arrival of the particles at the axis of measurement is assumed to follow the 
Poisson distribution with parameters fi and v for types X and Y respectively, 
thus the total number of particles expected in one unit of time is n = \i + v. 
Writing m = u/fi, the determinant of the Jacobian is det J = n/(l + m) 2 , 
leading to the transformation of the prior p nm = p^ v det J , whereby 

(9) p m cx n 2 (l + m)~ 2 ix~ l v~ 1 = mT l , 

which states that m is just as likely to be between 0.1 and 1 as it is to be 
between 1 and 10 before any observations are recorded. The prior measure 
for the parameter manifold is thus p r oc (mfg) , where the constant of 
proportionality is determined by the limits of consideration. In particular, 
finite limits for m must be symmetric in scale about unity so that the prior 
expectation of finding an X, 

(10) <p£) ro | 

remains equal to 1/2 and so that (m) m | moo = {m~ 1 ) m \m 00 - 

3. Comparison to simpler models. If the parameters /, g, f, and g 
for the Gaussian distributions are known in advance, then only m need be 
estimated from the data. Using the notation q^ = — log p^ , the "parameter 
info" (information content of the evidence density) is 

(11) qJ K = q J m K +q m + C = iVlog(l + m) - {K - 1) logm + C , 

where C is the logarithm of the normalizing constant, equal to log/3(J, K) 
when moo = oo. Its mode may be found from the equation for a vanishing 
gradient 

(12) = d m qf K = mr X {l + m)- x [l -K + m(J + 1)] , 



J l moo 
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Fig 1. Comparison of prob(X at z) as a function of estimated from the expectation 
value (p^m z ) m | jk shown as □ to its approximation shown as O /or J and if as 

indicated above each panel. 



whose solution mo = (K — 1) /( J + 1) may be negative when K = 0, in which 
case the mode is at the lower limit of consideration; if one inverts the identity 
of X and Y, one finds that the evidence for the inverted m has its mode at 
the position given by the analytic formula. In the limit of m £ [0,oo], the 
expectation that some new particle is of type X is (Pm)m\ jk = J)~ x , 

the expectation value for m is {m) m \j K = K/{J — 1), and that for to -1 is 
(m~ 1 ) m | jk = J/(K — 1). When the new particle's location z also is known, 
one may evaluate (p^m Z )m \ jk as the prediction for its type being X; when 
both J and K are large, that value is approximately given by simply using 
the expectation for m in the model for pf^ z , i-e. (pf^ z } m \jK ~P^fm1 ^ or 
J, K ^> 1. In Figure 1 we compare (pf^ z )m \ jk to pf^^) as a f unc tion of 
£(z) for various values of J and K. 

Let us next consider the case where only the deviations / and g are known 
in advance, so that now / and g must also be estimated from the data. For 
convenience, let us further suppose that / = g = 1, setting the unit for the 
locations z. The values of the parameters used to generate the data will be 
denoted ma, fn, and gn, where the subscript ft indicates conditioning on 
the sum of all knowledge, i.e. they are the "true" values unknown to mere 
mortals. According to the model, the actual chance of finding an X at some 
z is given by pf^ &tz = [1 + mn(n(z)]~ l , thus the estimation of that quantity, 
as well as its reliability, from the data at hand is the desired goal of the 
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statistical analysis. The manifold position of course is not allowed to be 
part of that process, as its knowledge would preclude the need to collect any 
data. 

Retaining only m, f, and g in r, the evidence density is 

(13) pkYKP^pJpJp™ 1 ' 9 , 

where the prior is normalized to unit volume, thus the parameter info is 

(14) lkY = 1 J m K + r + qf + qJ + C, 

and the value of C is chosen according to the task at hand. For taking ex- 
pectation values, one convenient choice is that which normalizes the peak of 
the evidence to unity Co = — qo, whereas for hypothesis testing (model selec- 
tion) C must equal the logarithm of the normalizing constant for the prior, 
and the remaining terms must retain any constants found in the likelihood, 
unless they happen to cancel out of the relative evidence ratio. Here, C r = 
log(21ogmoo) + 21ogA 2 when m G [m" 1 ,?™^] and f,g G [-A z /2, A 2 /2]. 
The expression 

(15) qf + ^ = f log(2vr) + \ £(/ - zf + \ J2(9 - zf 

gives the additional likelihood info coming from the location measurements, 
and its mode is easily found to be at /o = {Xj)j and go = (Yk)k- 

Because the preferred locations / and g are themselves estimated, one 
must ask whether the data would be more efficiently represented by a single 
Gaussian, which we will call h{z) with mean h. The relevant factors in 
the relative evidence ratio are those which do not depend on m, and the 
problem reduces to the well-known example of whether the difference in 
means between two populations is statistically significant. The answer is 
given by the ratio of the expectation values for the likelihood of each model, 
(16) 

f 9 _ <Pf?)fs _ IX% 1-^/2^ dfdg (pjpJA f 2,N y/ 2 

Ph ~ (pf Y h f%%Pt^dh w ^ ) w,jk) ' 

where the first factor is the ratio of peak likelihoods (when the prior is 
uniform), the second (Occam) factor is the ratio of the filling fractions for 
each model, and the approximation results from taking infinite bounds in 
the integrals. The filling fraction for a model is a number between and 
1 which indicates how much the evidence fills the parameter manifold with 
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Fig 2. Comparison of the numerical estimate for (p m )m\jK displayed as to that for 
(Pzi- at z )r | xy displayed as □ using limits of integration given by A z = 12 and xo as 
indicated above each panel, with the analytic value for (p m ) m \ jk displayed as O- 



respect to the prior measure. If the relative evidence ratio is well below unity 
pjf -C 1, then probability theory is telling one to neglect the z dependence 
and simply use m as the basis for any prediction regarding the type of some 
unidentified particle. If either J or K equals 0, attention to factors reveals 
that pjf = 1 in that C9/SG , clS the evidence density is uniform along the 
irrelevant parameter. 

Using finite limits for numerical integration can have an impact on the 
result. Recalling that conditioning on N has been implied throughout, let 
us compare the estimate of (p n \} m \jK to (p^ at2 ) r |XY f° r the two cases of 
m ro = iV+1 and = oo. The former represents a prior state of knowledge 
in which one is certain to observe particles of both types eventually, even 
though particles of only one type have been observed so far. The numerical 
integration is more easily accomplished upon a change of variables x = 
(1 + m) _1 such that 

/•moo fl— Xq 

(17) / m K ~ 1 (l+m)- J - K dm= x J ~ 1 (l — x) K ^ 1 dx < (3(J, K) , 

with equality in the limit xq — > 0. One way to assess the reliability of one's 
estimate is to inspect the ratio of the numerically integrated evidence volume 
to that derived analytically for the infinite manifold, identified as Panf 1 - 
Figure 2 shows the comparison of the estimated operators for values of N = 3 
and N = 5 using fn = l and gn = —1, as well as the analytic expression for 
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Fig 3. Comparison of the analytic value for {Pm}m\JK displayed as O to the numerical 
estimate for (p^. at z ) r \ xy displayed as □ for values of N as indicated above each panel, 
with their weighted mean p^, at z displayed as and the underlying distribution p^ at z 
displayed as x . Also shown are the values for the parameter mode ro. 



(Pm)m\jK when moo = oo. With only a few measurements, the numerical 
estimate for (pf r atz } r \x.Y approaches the analytic value for (p n \)m\jK onr y 
when moo is very large. After more measurements have accumulated, the 
limits on m have less impact; however, if the mode in m is close to the 
numerical limit, the estimate may still be inaccurate. 

To collect the estimates from the two possible models into a single predic- 
tion, one simply averages them with weights given by their relative evidence, 

(18) pg"" = (^(pf r at2 ) r |XY + ( Pm )m\JK)/(p f h 9 + 1) • 

As evidence accumulates in favor of the f/g model, their average quickly 
approaches the estimate from just that model, as seen in Figure 3. Using 
values of m-n = 1, /n = —gn = 1 and integration limits of A z = 12 and 
xq = 0, it takes only a few tens of measurements before the evidence for the 
null hypothesis (the h model) is negligible. As the number of measurements 
in the training data grows, the estimate p^ p atz converges to the underlying 
O, distribution for the chance of finding an X at z. 

So far we have said nothing about the rate of convergence of the estimate 
(Pzr atz )r\ xy as a function of the number of measurements N. Partly that is 
because we have summarized our inference about p^xy* ^ n ^° a sm §l e number, 
the expectation value of the observable p^ r atz , and to answer the question 
of convergence requires keeping track of two numbers for the inference, rep- 
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Fig 4. Comparison of the analytic value for {pm)m\jK displayed as O to the numerical 
estimate for (p^. atz ) r | xy displayed as □ for values of N and fn = —gn as indicated above 
each panel, with their weighted mean p^, at z displayed as () and the underlying distribution 
p*Q tz displayed as x . Also shown are the values for the parameter mode ro. 



resenting for example its central location and its width. That procedure will 
be addressed later in this article. While we have blithely displayed p^Q tz in 
the preceding figure, one should never forget that its knowledge is beyond 
the ken of mortals. Practically speaking, one simply must collect a sufficient 
amount of data such that collecting more data no longer significantly influ- 
ences the estimate, implicitly assuming that the underlying physical process 
is stationary in time. 

The ratio of the the deviation in the data / to the separation of the 
preferred locations f — g affects how much data is necessary for convergence 
of the estimate. When that ratio is small, not many measurements are needed 
before the null hypothesis is discounted, and further measurements serve 
only to improve the convergence. However, when that ratio is large, the null 
hypothesis can be discounted only after a sufficient number of measurements 
have been taken so that the parameter evidence is well resolved. In Figure 4 
we show the same estimates as in Figure 3 but for N equal to 10 and 20 and 
/n = — <?q of 2 and 1/2. Even when the mode ro gives a poor reckoning of 
the underlying process, the expectation value (p^ atz ) r | xy is fairly accurate 
when pjf 3> 1. To make from pj^ a number comparable to the P (or Q) 
value of frequentist methods, one would state that the null hypothesis (h 
model) is discounted at the level of (pjf + l) -1 . 
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4. Integration over unknown deviation. Let us now consider the 
more realistic situation where the deviations of the location distributions are 
not known in advance. For convenience, let us assign them all the domain 
of a G [cto> Coo] with prior p a oc <t _1 . The normalization constant is A\ oga = 
log (Too — log do, which equals infinity if either <7oo = oo or do = 0. The first 
task is to evaluate the relative evidence of the models, 
(19) 

jA z /2 f A,/2_ fCToo r<x> f-j-iz-K-1 ^-1/2(^2 + ^ 
^ h • • rA z /2 raoo Z^N 



fa _ fXh IX/2 C C r J - l r K ~ l ^ 2 {x) + xj) j/gj/g 



where xj = SzexK/ ~~ z )//] 2 anc l similarly for x 2 , and x\i which can be 
approximated as before by taking the Gaussian integrals over infinite limits 
yet retaining the finite normalization to yield 

f9 „ ( 2,n \ ^ z: c ' K ^- 1/2 (e f /p + e a m dm 

{ >Ph ~W z JK) A loga f^h-"e W -V\e h /h 2 )dh 

where £j = J((Xj)j — (Xj)J) and similarly for £ 2 and The remaining 
integrals can be evaluated analytically to give the result 

m) J 9 „ (*N/JK)V*$- 1 Ar(Zf)Ar(Z g ) 

h 2A 2 A logCT ^- 1 ef- 1 A r fe) ' 

where Ap is defined in terms of the upper incomplete gamma function 
T(a,z) = *°- 1 e- t dt, such that 

(22) MW-r(^,|- 

and similarly for Ap(£ 9 ) and Ar-(£h). Note that this formulation makes no 
use of the peak evidence ratio but instead is expressed entirely in terms of 
the data and the limits of the prior. 

To recover the peak evidence ratio (and thus the Occam factor), one 
needs to evaluate the model evidence densities at their mode positions. The 
parameter info for the X data is now 

(23) qg = (J + 1) log/ + (2/ 2 )- 1 £(/ - zf + C , 

zex 




whose gradient is given by 
(24) Vqff = r 3 



/£, 6 x(/-*) 



v ■ i)/ 2 >:,x(/ 2 
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which vanishes at the mode Vg£ ■ (/o,/o) = 0. As before, the mode in / is 
at the mean of the locations /o = {Xj)j, and the mode in / can be written 
as /o = (J + l) -1 / 2 ^/. The position of the mode for the remaining Gaussian 
parameters is found similarly. 

Let us briefly discuss the limits of integration hence the normalization of 
the prior. If the prior is not to be based upon the current crop of measure- 
ments, where does the information for the limits come from? The practical 
answer is that the limits are determined by the nature of the measurement 
apparatus. Any set of measurements collected within a finite span of time 
necessarily are limited by the range and resolution of the device used for their 
collection, for example measurements of the voltage of a circuit collected by 
a common voltmeter. As long as no measurement "pegs the needle" one 
can safely use limits based on the range of the device; those that do can be 
addressed through an appropriately modified contribution to the likelihood 
beyond the scope of this article. Similarly, the resolution of the device (or 
the width of the particle) sets a lower limit on what can be said about any 
measured deviations in the population locations. For the evaluation of the 
evidence ratio pjf as well as /ojjjjf 1 , we will set the limits for the deviations 
as o"oo = A z and o~o = 10 _4 A Z , with A z = 12 and xq = as above. 

The evaluation of the expectation value of the observable (pf r at z ) r |XY 
proceeds as before, only now the integration is over a 5 dimensional param- 
eter manifold. As Press et al. (1992) state, "integrals of functions of several 
variables, over regions with dimension greater than one, are not easy." For 
problems of Bayesian inference, the majority of the contribution to the inte- 
gral comes from a region localized around the peak of the evidence density 
when sufficient data exists that the limits of the prior are irrelevant. Luckily, 
for this problem the evidence mode is unique and analytic, so that one may 
select limits for the numerical integration much tighter than those given by 
the prior while still encompassing 99.9% of the normalized evidence density. 
The evaluation is performed using an adaptive grid algorithm (Genz and 
Malik, 1980; Berntsen, Espelid and Genz, 1991) over a small region of the 
manifold centered on the position of the optimal parameter values. 

In Figure 5 we compare the expectation of the observable {pf r atz ) r \ xy to 
that given by evaluating the observable using the parameter mode X Y > 
with the results given in Table 1. As the number of measurements increases, 
those estimates draw closer, according to the narrowing of the peak in the 
evidence density. The estimate from the expectation value is "more conserva- 
tive" than that from the mode, in that it is closer to the estimate (Pm)m | xy 
(not shown). While the estimate from the mode more closely resembles the 
underlying distribution pf^ tz , in a real world situation we are not privy 



12 



R. W. JOHNSON 



1 

0.8 
0.6 
0.4 
0.2 
Ol 

1 

0.8 
0.6 
0.4 
0.2 




N = 50, J = 23, K = 27, = 1 







4 



N = 50, J = 23, K = 27, pSi[=1 



TOT 



T" 



I 

X 



X 

t 



N = 50, J = 24, K = 26, p™, = 0.998 



0.8 
0.6 
0.4 
0.2 
0! 

1 

0.8 
0.6 
0.4 

I 

0.2 

o: 



N = 50, J = 22, K = 28, P^[=_L 
— I 1 1 



Fig 5. Comparison of the expectation value (p^- atz ) r |XY displayed as □ to the estimate 
from the parameter mode pf^'xy displayed as O for N = 50 as well as the underlying 
distribution pfn tz displayed as x with parameter values as given in Table 1. 



to that knowledge (which would obviate the need for a statistical analysis). 
The expectation value (p^. atz ) r |XY summarizes what the data have to say 
about the observable into a single number. Using P^jxY as a P rox y f° r the 
mode of the observable (which is not quite the same thing as the observable 
of the mode), one could ascribe to pf r atz a beta distribution according to 
its mean and mode; the maximum likelihood estimate of the parameters of 
the beta distribution for the observable p^ r at 2 requires evaluation of the two 
observables (logpf r atz ) r | X Y and (log(l - pf r ^ z )) r \ xy- 



Table 1 
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Fig 6. Comparison of the expectation value (p^- atz ) r |XY displayed as □ to (pm)m |xy 
displayed as O for various N as well as the underlying distribution p^q 1 z displayed as x 
with parameter values as given in Table 2. 



5. Comparison to non-parametric classification. Let us now look 
at how the MaxEnt prediction compares to those derived from some non- 
parametric schemes commonly employed for this type of problem. For this 
section we will use the same data for each method generated using parameter 
values rriQ = 1, /n = — gn = 2, and fn=gn = 2 for various N. We will 
consider both a nearest neighbor classification scheme which produces its 
estimate from a subset of the data "close" to the desired location as well as 
a classification scheme based upon a kernel density estimate of the identified 
particle distributions. The MaxEnt estimates from these sets of data are 
shown in Figure 6 with parameter modes in Table 2. 

Starting with nearest neighbor classification, its prediction for the type 
of some new particle, 

(25) p£3& = k x /{k x + k y ) , 

is conditioned on the number of neighbors K = Kx + chosen to be influ- 
ential, denoted here as k = p K N for < p K < 1 such that k is an integer. 
The selection of k is arbitrary, but Hall, Park and Samworth (2008) describe 
a method for choosing its value based upon bootstrap estimates from the 
data. Here, however, we are interested in the case where there is so little 
data that bootstrap estimates are unlikely to be reliable. Consequently, we 
will consider the set of ratios p K £ [1/8, 1/4, 1/2, 1]. Another distinction is 
that they assign a type of X or Y to the new particle at z according to 
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Table 2 

Parameters corresponding to Figure 6 
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49 
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2.11 



whether P^xy * s S rea ^ er or ^ ess than 1 /2 rather than expressing the chance 
of finding an X at probability. 

The results of the preceding method are shown in Figure 7. One feature of 
the nearest neighbors method is that its prediction is quantized in units of 
1/k, which can lead to large jumps in the estimate when there is not much 
data. These jumps yield an estimate which is not smooth as a function of z, 
even as the number of measurements approaches 100. By basing its predic- 
tion on the rank of the distances from the data to the desired location, this 
method throws away information pertinent to the analysis. Consequently, its 
prediction is a coarsely grained representation of the underlying distribution, 
even for moderately large sets of data. 

Alternately, we can consider a classification scheme based upon a ker- 
nel density estimate of the particle distributions by type. Its prediction 
PzAXY ^ s conditioned on the bandwidth parameter A for the kernel reso- 
lution. The kernel basis chosen is that given by the Gaussian distributions 
5\(z) = exp -1 / 2 (2i 2 /A 2 ), with peaks normalized to unity for convenience 
later. The kernel density estimate for the distribution of type X is equal 
to the sum of the kernel basis functions centered on the datum locations 
/ax (z) = J^j $\( z ~ Xj), and similarly for gxv( z )- The classification predic- 
tion is then determined from the ratio 

(26) PzAXY = /ax(*)/[/ax(*) + 9xy(z)} , 

using bandwidth parameters of A £ [1/4, 1, 4, 16]. 

The results for the preceding method are shown in Figure 8. Again, large 
jumps are apparent in the estimate when the kernel width is much smaller 
than what we know the particle deviations to be, but the values are con- 
tinuous rather than being quantized. The effect of the kernel basis is that 
of a smoothing filter which spreads the information from each datum over 
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Fig 7. Comparison of the nearest neighbor estimates p^^xi f or various k given by k/N £ 
[1/8, 1/4, 1/2, 1] shown as □, O, 0, aid A respectively. 



a range of nearby locations according to A. In essence, with this method 
one is convoluting the data with some point spread function to produce an 
estimate of the observable at all locations z based on the discrete list of 
measured locations for classified particles. 

The term "non-parametric" is actually a bit of a misnomer, as the eval- 
uation of either method requires specification of a parameter representing 
the bandwidth of the resolution filter. While the bandwidth in z for the 
kernel density prediction is constant for a given A, for a given k that for the 
nearest neighbor prediction is not, based as it is on the rank of the distances 
in z rather than their values. In the limit of infinite bandwidth, such that 
the z dependence disappears, both models give a prediction equal to that 
of the Bayesian estimate (Pm)m | xy = J /{J + ^Q> as indicated by A in the 
figures — the reason for the peak normalization of the kernel basis is so the 
kernel density estimates equal J or K for all z is this limit. The selection 
of the optimal value of the bandwidth parameter requires definition of some 
metric for its merit, introducing yet another source of subjectivity into the 
methodology. In contrast, the only arbitrary elements of the MaxEnt method 
are the limits of the prior, everything else following from repeated applica- 
tions of the rules of probability theory to the state of knowledge specified 
at the outset, and even those are not truly arbitrary when one considers the 
physical nature of the apparatus and the objects. 
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Fig 8. Comparison of the kernel density estimate predictions JJ^axy f or various A given 
by A 6 [1/4, 1,4, 16] shown as □, O, 0, and A respectively. 

6. Model mismatch. What happens when the model used for the 
analysis of the data does not correspond in functional form to that of the 
underlying physical distribution? Without insight into the true nature of the 
objects measured, there is a priori no reason to suppose that some function 
chosen to resemble the data corresponds to that of the underlying physics. 
Specifically, let us consider a distribution for the location z < 4> of particles 
of type X given by 



where Zf = <f>— z, and similarly for gn(z g | 7) using z g = z — j, which one may 
recognize as a gamma distribution with a mean of 4> 2 and a variance of (j) 3 
reflected in z and offset by (p. In panels (a) and (b) of Figure 9 are histograms 
of the locations of X and Y particles drawn from such a distribution with 
run = 1, 0n = 70 = 1.5, and 4>n = -jq = 4. 

Using the model of the preceding section, we get the estimates for the 
observable (p^ at2 ) r |XY an d P^jxY shown in panel (c) of Figure 9. While 
the observable resembles the physical distribution in the region of overlap 
between the particle types, the model is not designed to handle the case 
of finite boundaries in the location distribution. For the unidentified par- 
ticles found outside that region, this model will almost always make the 
wrong prediction. One's suspicion might be aroused by noticing that the 
only measurements in the extreme regions are in contradiction to the model's 



(27) 
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Fig 9. Histograms of location for type X in (a) and Y in (b) with a bin width of 0.5 for a 
set of 100 particles which yield the predictions (p^ at z ) r | xy displayed as □ and P^jxy 
displayed as O using the Gaussian distribution in (c) and the gamma distribution in (d) 
as well as the underlying distribution p*q 1 z displayed as x . 

estimate. Of course, simply looking at the histograms reveals that the sym- 
metric Gaussian model is not going to be the best fitting distribution for 
the identified particle locations. 

Suppose now that somehow we gain knowledge of the functional form for 
fn(z) and gn(z), for example by learning that the particles are racquetballs 
hit out of a tunnel such that X balls are bounced off the right wall and 
Y balls are bounced off the left, with the measurements for the z locations 
taken some distance away from the outlet of the tunnel. The parameters 0n 
and 7q are assumed to be known from the geometry of the apparatus, with 
the parameters r = (m, 4>, 7) to be determined. The new parameter evidence 

* s ^ < ^~ 1 rizex f( z f) wri en the prior p& oc <j)~ l is used, and similarly 
for 7. To remain finite as Zf — > 0, one requires (j) > 1, and the mode value 
4>o is easily found numerically, yielding the mode estimate p^r^ z > as shown 
in panel (d) of Figure 9. An upper limit of eft < 4 is consistent with the 
measurements, and the expectation of the observable (p^ r atz ) r \x.Y can be 
evaluated, also shown in panel (d). Having the correct physical model, even 
if its parameters are undetermined, is certainly an asset when attempting 
to use measurements to make predictions. 

The point of this section is to emphasize how important one's knowledge 
of the situation is to the determination of one's results. The model one 
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selects should be based upon as much information as is available. When no 
single model presents itself as being physically correct, one must consider the 
alternatives, most often after having looked at the data — if the data look like 
an exponential decay, there is not much point in fitting a Gaussian. The prior 
for its parameters likewise should draw upon that background knowledge yet 
remain as unbiased as possible towards the final outcome. What Bayesian 
methods provide is a systematic framework for the comparison of models 
which all do a reasonable job of fitting the data, especially when the number 
or quality of measurements is low. The MaxEnt principle supplements Bayes' 
theorem by providing a systematic framework for the evaluation of the least 
biased prior based upon similarity transformations of the model with respect 
to the data. 

7. Unknown means and deviations with measurement uncer- 
tainty. Returning to the use of the Gaussian model for the particle loca- 
tions, let us now suppose the even more realistic situation where the location 
measurements are themselves subject to Gaussian deviation a, presumed to 
be known from calibration of the measurement device, which for convenience 
will be set equal to the unit for z such that a = 1. The particle locations 
are supposed to be drawn from independent distributions as before, whose 
parameters are to be determined. The model selection ratio can be written 
in terms of the relative model likelihoods as 

/ oq\ jg = (pjfhl&s'a PfP g 

[ ] Ph ~ (Pf?k~ h ~ A 2 A logCT p, ' 

with the prior limits notated as before. Focusing on the model for type X, 
each datum likelihood must now be expressed as an integral over all possible 
values of Xj according to the resolution of the apparatus, 



(29a) V % = f(z S )p z J Xj 

3 1 J -oo 



dZj 



-oo 

/oo 
(27r/)- 1 exp- 1 /2 {[(/ -_ Zj)/ ff + ( Xj - zjfjdzj 
-oo 

(29c) = [27r(l + / 2 )]- 1 / 2 exp- 1 /2[ (/ -_^. ) 2/(i + /2 )] _ 

The parameter evidence is now given by 

(30) p& cx /- 1 [2vr(l + f 2 )]- J /" exp-^Kf - 2f(X j ) 3 + (X 2 ),)/(l + P)] , 
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retaining explicitly the normalization of the likelihood but not the prior. 
The integral over / proceeds as before, 

i>oo 

(31a) p-U = / P&df 

J —CO 

(31b) oc f-\2n(l + p)f- j y\j-^ex p -^[e f /(l + h] , 

yielding the marginal evidence for /. Under a change of variable a 2 = 1 + f 2 
such that f~ 1 df = a(a 2 — l)~ 1 da, the marginal evidence for / can be 
rewritten as 

(32) K x oc (2vr)( 1 - J )/ 2 J- x / 2 (l - d' 2 )-^^ J exp-^^/a 2 ) , 
thus the remaining integral over a can be written as an infinite series, 

oo 

(33) p f « (27r)( 1 - J )/ 2 (8J)- 1 / 2 ^ 2 ( J+2Q )/ 2 ^- J - 2Q A r (a,^) , 

where the approximation results from finite A 2 and Ar now depends on a 
as well as the limits of integration <7o,oo = (1 + c>o,oo) 1//2 > 

/ J + 2a-l & \ /j + 2a-l ¥ 



(34) A r (a,0) = r ,^ 



/ 

'2a 2 



The relative likelihoods p g and p^ are evaluated similarly. 

The gradient of the parameter info, 
(35) 

J(l + / 2 )- 1 (/-(^) J ) 

f-l + J(l + /*)-!/[! _ (1 + /2)-l (/ 2 _ 2 f {Xj)j + {X 2 }j)] _ , 



yields the same mode for the central location f$ = When that value 

is substituted into the expression for <9^ x , the equation for the mode of 
the deviation becomes 

(36) = (J + l)# + (J + 2-0$ + l, 
whose root minimizes its contribution to the parameter info, 

(37) f = min [log / + log(l + f 2 ) J ' 2 + f f /2(l + f 2 ) 
The remaining modes go and ho are found similarly. 
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Fig 10. Comparison of the expectation value (p^-?* Z ) T \ ct xy displayed as □ to t/ie estimate 
from the parameter mode pf^'^xY displayed as O as «;eZZ as the underlying distribution 
Pza-n z displayed as x uiito, parameter values as given in Table 3. 



The uncertainty in the measurement apparatus must also be taken into 
account when evaluating the observable. Given some measured location z 
for an unidentified particle, what we know is that its actual value zq is 
distributed around z with deviation a. Consequently, the integration is now 
over not only the 5 dimensional parameter manifold but also over all possible 
values of zq, 

(38) vlt^= [ AZ/2 !%%&p£dz a = [ AZ/2 fp&' a J&xYl%drdz a , 

J-A z /2 J-A z /2Jr 

which is evaluated numerically as before. The restricted limits of integration 
are found for the independent submanifolds such that the net integration 
measure is approximately normalized, p^f 1 ~ 1. 

In Figure 10 we show the results of such an integration for N particles as 
indicated above each panel generated using parameters found in Table 3 with 
A 2 = 60. We can see that, despite the additional dimension of integration, 
the results are quite similar to what we had before. That should be no 
surprise, as the normal distribution has the feature that its mean and mode 
are at the same value. If during the calibration procedure a resolution (point 
spread) function other than Gaussian is determined for the apparatus, it 
should of course be used instead. 
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Table 3 

Parameters corresponding to Figure 10 
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8. Reliability of the estimate. So far we have danced around the 
topic of determining the reliability of the estimate strictly from the measure- 
ments at hand. In all the previous figures we have displayed the underlying 
physical distribution as a means of establishing that the method does indeed 
approach the "true" value as the number of measurements increases. How- 
ever, in the real world, that knowledge is beyond our ken; we must make do 
with what the data have to say for themselves. That information is encoded 
in the evidence density for the model parameters given the measurements 
and the resolution of the apparatus. 

When sufficient data exist that the significant evidence is restricted to 
some tiny region around a mode that barely moves as more data is col- 
lected, then we may as well call the prediction from the mode our single 
best estimate of the underlying distribution, pfj^ 2 ~ pfjn*- ^ n that case, 
the chance of making the correct prediction for some new datum is given by 
a simple truth table. Using the notation xq = pf a Q Z , we have 
(39) 

p(pf a ll z ''is correct) sa Tr 

which indicates that even knowledge of the physical distribution does not 
guarantee a certain prediction for the particle type of the new datum; to 
be absolutely certain of the new particle's type for any location, one must 
measure its classification. The error rate is greatest at the location where 
the particles appear with equal likelihood, as the chance of a successful 
prediction there is 1/2. When the mode of the model parameters is extremely 
well determined by the data, the error rate lies between and 50% according 
to the value ofp^ tz . 

While the prediction from the mode is the most likely contribution, the 
prediction from the mean of the observable is what encodes our best inference 
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about its value into a single number. If we want to know more about the 
observable, we have to do more work. As mentioned earlier, to describe the 
distribution around the expected value of the observable at some location, 
we need an additional parameter for its width and some function for its 
shape. The natural distribution for unit normalized positive quantities is 
the beta distribution p x ah = x a_1 (l — x) fe_1 //3(a, b) that we have encountered 
in various guises already. Here, the problem is to determine its parameters 
a z and b z given empirical knowledge of the distribution of x z = pfjxY • 

One can easily show that the maximum likelihood estimate of the param- 
eters is given by the solution of the system of equations 

(40a) Ai(a z ) - Ai(a z + b z ) = {\ogx z ) , 

(40b) Ai(6 z )-Ai(o z + 6,) = (log(l - x,)> , 

using the notation Afc(r) = (d r ) k logT(r) for the polygamma functions with 
integer order k and real argument r, which selects the parameters for the beta 
distribution whose values of (logx^) and (log(l — x z )) equal those estimated 
from the empirical measurements. We set aside (for now) the question of 
whether a nonuniform prior p ab should be incorporated on the grounds that 
a lot of effort will be put into converging the integrals such that the prior 
should have little effect. 

The task now is to evaluate those expectation values conditioned on the 
given set of data. Instead of a single observable at each displayed location, 
there are now two to calculate, the first given by 

/■A./2 r 

(41) (logx z ) = - / / log[l + m((z n )]p r aXY p z £ drdzn , 

J-A z /2 Jr 

and the second given by 

/■A./2 r 

(42) (log(l - x z )) = {\ogx z ) + / / logim((z n )Kx Y p z z yrdzn • 

J-A z /2 Jr 

From those two estimates one finds the corresponding a z and b z according 
to Equations (40a) and (40b) above. The integral over zn in the second term 
can be expressed analytically when taken over infinite limits, 

(43) J log[m((z u )}p z £dz Q = log I \ + y - ^ ^ , 

thereby reducing the amount of effort required for its evaluation. 
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Fig 11. Comparison of the expectation value of the observable (x z ) displayed as □ to its 
mode XzO displayed as O as well as the expected chance that a prediction based on (x z ) is 
correct displayed as + with parameter values as given in Table 4- 



Having found a z and b z , the mean of the observable x z may be determined 
according to (x z ) = a z /(a z + b z ), and its mode is given by x z o = (a z — 
1) / (a z + b z — 2) when a z , b z > 1 else may be found at or 1. When using the 
expectation value to make predictions, the chance of a successful prediction 
is itself given by an expectation value, 

(44) {p(x z is correct)) = (x n x z ) + ((1 - x n ){l - x z )) rj (x z ) 2 + (1 - {x z )) 2 , 

where the approximation results from using x z as a proxy for xq. As promised, 
the Bayesian methodology has delivered an estimate of the observable used 
for predicting the type of unclassified particles as well as an estimate of its 
error rate based entirely upon the data at hand. In Figure 11 we show the 
mean and mode of the observable x z , as well as the chance of a correct pre- 
diction according to Equation (44), for iV particles as indicated above each 
panel generated using parameters found in Table 4. 



9. Discussion. The purpose of the preceding exercises is to demon- 
strate how Bayes' theorem and the principle of maximum entropy are used 
to extract meaningful information from a set of data. When the observ- 
able representing the desired knowledge is not one of the parameters of the 
model, then the single best inference of its value is given by its expectation 
taken over the parameter manifold and weighted by the evidence at each pa- 
rameter position. While the estimate from the parameter mode is the single 
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Table 4 

Parameters corresponding to Figure 11 
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most likely contribution, it is the mean of the observable which takes into 
account how much evidence there is in the data for other possibilities. When 
the MaxEnt principle is used to find the invariant measure for the parameter 
manifold, the results of a Bayesian analysis are the least biased possible, as 
each datum updates the evidence density according to its contribution to 
the available information which starts with the statement of the geometric 
properties of the model with respect to the data. 

One feature of Bayesian data analysis as expressed in the language of 
conditional probabilities is that it forces one to specify the state of back- 
ground knowledge upon which any estimate is based. Whenever real data 
are discussed, such statements necessarily include remarks on the nature of 
the measurement apparatus. That apparatus can be in the form of either 
a physical device such as a voltmeter or an abstract device such as a sur- 
vey. While the parameter z has been given the interpretation of a spatial 
coordinate, it represents any observable that can be expressed in terms of 
a location-type parameter with uniform measure over an arbitrarily large 
domain. That parameter can of course be generalized to a data vector of 
any dimensionality according to the complexity of the situation at hand. 

When one is not certain that the background knowledge includes speci- 
fication of the physically correct functional form for the constituent distri- 
butions, there is a formal procedure for comparing the relative likelihood 
of competing models. That comparison is expressed not by the ratio of the 
peak likelihoods given by each model's parameter mode but rather by the 
ratio of the expected likelihoods given by the integral of the evidence over 
each model's parameter manifold. The distinction is important, as it is the 
latter which takes into account the principle of efficiency through the Oc- 
cam factor. The most useful form of the parameter evidence density is that 
which drops the normalization of the prior yet retains any constant factors 
in the likelihood, as its integral appears as both the relative likelihood of the 
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model and the normalization constant for taking expectation values of ob- 
servables. Care must be taken when dropping constants from an evaluation 
(whether numerical or analytic), as it is easy to make a mistake regarding 
the appropriate normalization for the given task. 

When not all of the prior normalization factors cancel out of the model 
comparison ratio, then they must be accounted for explicitly. Quite often the 
MaxEnt prior will yield an infinite normalization when allowed to extend 
over an infinite domain, which is no less a problem for a maximum likelihood 
analysis that assigns any and all parameters a uniform prior. In that case, 
one is forced to consider more fully the nature of the apparatus as well as 
of the objects to be measured in order to determine sensible boundaries for 
one's prior state of knowledge. One generally hopes that finite boundaries do 
not truncate significantly the evidence density, but there can be times when 
physical constraints (such as positivity) dictate that what would otherwise 
be the mode lies outside the allowable parameter domain. 

The process of inductive reasoning, through which one expresses one's de- 
gree of belief in the value of some observable, is not restricted to the simple 
distributions considered here but extends naturally to much more compli- 
cated situations. By going through the development of the methodology for 
increasingly less restrictive states of prior knowledge, we hope that readers 
have gained some insight into how to apply the method to problems they 
encounter. While almost everyone is comfortable with the process of finding 
the parameter mode, and most with taking expectation values according to 
the evidence density, the process of model selection often seems mysterious 
until one realizes that the relevant factors are just the relative likelihoods 
and any leftover prior normalizations. 

10. Conclusion. In this article we have explored the MaxEnt approach 
to the problem of predicting the type of some unclassified particle given its 
location and a list of locations for particles of identified classification. The 
process of inductive reasoning is used to relate these quantities of inter- 
est, which incorporates both Bayes' theorem and the principle of maximum 
entropy to produce the least biased estimate from the data at hand. The 
expectation value of the observable is compared to that derived from the pa- 
rameter mode, and they converge on the underlying distribution when suf- 
ficient data exists and the model function is known to be physically correct. 
When competing models must be compared, the procedure of evaluating the 
evidence ratio in terms of the relative likelihoods and prior normalizations 
has been explicitly determined. 

The prediction from the Bayesian method is found to be superior to those 
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from non-parametric algorithms, such as nearest neighbor or kernel density 
estimation, especially when not much data exists for analysis. The reason 
is because conditional probability theory makes full use of all the available 
information rather than discarding the contribution from some datum on 
the grounds that it is far away from the desired location or smearing that 
information out according to some linear operator. The essential difference 
between inductive and deductive methods is that the former inverts the 
model to predict the data whereas the latter inverts the data to predict 
the model — see Johnson (2010) for an explicit example regarding sums of 
exponential functions and Johnson (2012) as regards the Fourier transform. 
Having invested in the presumably expensive classification of the particles 
in the training set, it makes sense that one should choose the most reliable 
method for analyzing that data. It is only those of us who generate data 
cheaply that worry about the expense of computing an integral. 

A major advantage of the Bayesian method is that it is extensible. As new 
effects are brought into play, their parameters are included in the analysis 
in a straightforward manner. The language of conditional probability the- 
ory is sufficiently general that it can handle whatever state of background 
knowledge one specifies, including those which admit that competing mod- 
els are available as well as those which admit that the measurements are 
themselves subject to error. "Reasoning in the face of uncertainty" is an apt 
description of inferential logic and is a problem often faced in the imperfect 
real world. The use of Bayes' theorem requires one to specify the conditions 
upon which any and all probabilities are based, which seems like a lot of 
work at the outset but leads to a robust analysis upon completion. It also 
recognizes the difference between the most likely value of some observable 
and its expected value, which is important to consider when not many mea- 
surements are available. By working through the tasks of model selection, 
parameter estimation, and observable prediction explicitly for the case of 
these simple distributions, we hope the reader has learned how to apply the 
MaxEnt method to more complicated cases of data analysis. 
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